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Abstract 

We present a hierarchical maximum-margin clus¬ 
tering method for unsupervised data analysis. 

Our method extends beyond flat maximum- 
margin clustering, and performs clustering re¬ 
cursively in a top-down manner. We propose 
an effective greedy splitting criteria for selecting 
which cluster to split next, and employ regular¬ 
izes that enforce feature sharing/competition for 
capturing data semantics. Experimental results 
obtained on four standard datasets show that our 
method outperforms flat and hierarchical cluster¬ 
ing baselines, while forming clean and semanti¬ 
cally meaningful cluster hierarchies. 

1. Introduction 

Clustering is an important topic in machine learning that, 
after decades of research, remains a challenging and ac¬ 
tive topic of research. Clustering aims to group instances 
together based on their underlying similarity in an unsu¬ 
pervised manner. Clustering remains an active topic of re¬ 
search due to its widespread applicability in the areas of 
data analysis, visualization, computer vision, information 
retrieval, and natural language processing. Popular cluster¬ 
ing methods include k-means clustering (Lloyd, 1982) and 
spectral clustering (Ng et al., 2001). 
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Recent progress in maximum-margin methods has led to 
the development of maximum-margin clustering (MMC) 
techniques (Xu et al., 2004), which aim to learn both the 
separating hyperplanes that separate clusters of data, and 
the label assignments of instances to the clusters. MMC 
outperforms traditional clustering methods in many cases, 
largely due to the discriminative margin separation crite¬ 
rion imposed among clusters. 

However, MMC also has limitations. First, MMC is not 
particularly efficient. While efficient MMC methods have 
been proposed (Zhang et al., 2007; Zhao et al., 2008), 
even in such cases the time complexity is at least linear 
or quadratic with respect to the number of samples and 
clusters. This scalability issue is a significant problem 
when considering the scale of modern datasets. Second, 
MMC has difficulty identifying clusters with small mar¬ 
gins, which are particularly useful for fine-grained data. 
Consider clustering images of commercial vehicles. In 
such data the major source of dissimilarity among sam¬ 
ples is the viewpoint and this is what MMC is likely to 
focus on. The variations in the make of the vehicle, which 
are semantically more meaningful, would result in more lo¬ 
cal fine-grained differences and may be ignored by MMC’s 
flat-clustering criterion. 

Hierarchical clustering methods, which are typically based 
on a tree structure, have been extensively studied for their 
benefits over their flat clustering counterparts. These hier¬ 
archical clustering methods can discover hierarchical struc¬ 
tures in data that better represent many real-world data dis¬ 
tributions. Computationally, hierarchical clustering meth¬ 
ods are also often more efficient, because one can reduce 
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a single large clustering problem into a set of smaller sub¬ 
problems to be recursively solved. Since within each sub¬ 
problem the data only needs to be clustered into a small 
number of clusters, and for lower levels of the hierarchy 
only a small subset of the data participates in each cluster¬ 
ing step, this procedure tends to be a lot more efficient. 

To leverage such benefits, we propose a hierarchical exten¬ 
sion to MMC that recursively performs k-way clustering 
in a top-down manner. However, instead of naively per¬ 
forming MMC at each clustering step, we further leverage 
the observation from human-defined taxonomies that each 
grouping/splitting decision typically focuses on different 
features of the data. 

Suppose, again, that we want to cluster different types of 
commercial vehicles. Assuming we can cluster the data 
hierarchically, it is sensible to assume that first we should 
cluster the data based on the vehicle type (e.g., truck, SUV, 
sedan). Once we know which sub-group each instance be¬ 
longs to, we may want to employ other criteria to separate 
them, e.g., according to the price range or the make. We 
want to leverage a similar intuition to learn clusters that fo¬ 
cus on maximizing the margin along different directions at 
different levels in the hierarchy. Here, directions are de¬ 
fined by subsets of features from the much larger feature 
vectors describing each instance. More specifically, we em¬ 
ploy regularization that allows clusters to group and com¬ 
pete for the features at different levels. Such regularization 
has been made popular in semantic supervised learning in 
recent years (Xiao et al., 2011; Hwang et al., 2011), but 
here we apply the idea in an unsupervised hierarchical clus¬ 
tering framework. 

We test our hierarchical maximum-margin clustering 
(HMMC) method on several image datasets, and show that 
HMMC is able to outperform flat clustering methods like 
MMC. More significantly, it is able to discover clean and 
semantically meaningful cluster hierarchies, outperforming 
other hierarchical clustering alternatives. 

Our contributions are threefold: (i) we present a novel hi¬ 
erarchical clustering algorithm based on maximum-margin 
clustering with an effective greedy splitting criterion for se¬ 
lecting which cluster to split next, (ii) we employ regular¬ 
ization that enforces feature sharing/competition to learn 
clusters that can focus on important features during cluster¬ 
ing, and (iii) we empirically validate that our HMMC can 
learn semantically meaningful clusters without any human 
supervision. 

2. Related Work 

Maximum-margin clustering: MMC was first proposed 
by Xu et al. (Xu et al., 2004). It is a maximum-margin 
method for clustering, analogous to support vector ma¬ 


chines (SVMs) for supervised learning problems, that 
learns both the maximum-margin hyperplane for each clus¬ 
ter and the clustering assignment of instances to clusters. 
Since this joint learning results in a non-convex formula¬ 
tion, unlike SVMs, it is often solved by a semidefinite re¬ 
laxation (Xu et al., 2004; Valizadegan & Jin, 2006) or alter¬ 
nating optimization (Zhang et al., 2007). While most of the 
MMC methods focus on efficient optimization of the non- 
convex problems, the MMC formulation was also extended 
to handle the case of multi-cluster clustering problems (Xu 
& Schuurmans, 2005; Zhao et al., 2008), and to include 
latent variables (Zhou et al., 2013). 

Hierarchical clustering methods: Most hierarchical clus¬ 
tering methods employ either top-down clustering strate¬ 
gies that recursively split clusters into fine-grained clus¬ 
ters, or bottom-up clustering strategies that recursively 
group the smaller clusters into larger ones (Manning et al., 
2008). Our method is a top-down clustering method, 
and the canonical example of such a method is hierar¬ 
chical k-means clustering, which performs k-means recur¬ 
sively in a top-down manner (e.g., the bisecting k-means 
method (Steinbach et al., 2000)). Variations on this idea 
include hierarchical spectral clustering (e.g., PDDP (Bo- 
ley, 1998)) which performs the hierarchical clustering on 
the graph Laplacian of the similarity matrix, and model- 
based hierarchical clustering (Vaithyanathan & Dom, 2000; 
Castro et al., 2004; Goldberger & Roweis, 2004) which 
fits probabilistic models at each split. To the best of our 
knowledge, this is the first work using a maximum-margin 
approach for hierarchical clustering. 

Sharing/competing for features: Regularization methods 
that promote certain structures in the parameter or fea¬ 
ture spaces have been extensively studied in the context 
of regression, classification, and sparse coding. The group 
lasso (Meier et al., 2008) employs a mixed ^ 2 -norm to 
promote sparsity among groups of features, identifying the 
groups that are most important for the task. This has been 
applied to classification tasks like multi-task learning and 
multi-class classification, where it encourages the classi¬ 
fiers) to share features across the tasks/classes. A general¬ 
ization of the group lasso is the sparse group lasso (Fried¬ 
man et al., 2010), that further encourages sparsity within 
each individual model. 

However, in some cases it makes more sense to have 
models fit to exclusive sets of features. The exclusive 
lasso (Zhou et al., 2010) encourages two models to use 
different features, by minimizing the £ 2 -norm of their £\- 
norms. This discourages different models from having non¬ 
zero values along the same feature dimensions, encourag¬ 
ing each model to use features that are exclusive to their 
tasks. Orthogonal transfer (Xiao et al., 2011) focuses on 
such exclusiveness between parent and child models in a 
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taxonomy, and enforces the exclusivity through “orthogo¬ 
nal regularization” where we minimize the inner product 
of the SVM weights for parent and child nodes. The tree 
of metrics approach (Hwang et al., 2011) employs similar 
intuition, but learns Mahalanobis metrics instead of SVM 
weights, and focuses on selecting sparse and disjoint fea¬ 
tures. Tree-guided group lasso (Kim & Xing, 2010) em¬ 
ploy both sharing and exclusive regularizations, to promote 
sharing between the labels that belong to the same parent, 
while also enforcing exclusive fitting between them, guided 
by a predefined taxonomy. These methods consider su¬ 
pervised learning scenarios, while our method utilizes the 
grouping and exclusive regularizers for unsupervised clus¬ 
tering. 

3. Hierarchical Maximum-Margin Clustering 

We propose a hierarchical clustering method based on the 
maximum-margin criterion. We aim to find groups of data 
points with a large separation between them, while form¬ 
ing a cluster hierarchy. The proposed method builds on 
the standard flat MMC clustering (Xu et al., 2004), but ex¬ 
tends MMC in the following two aspects: (i) we introduce 
regularizers to encourage the different layers of the hier¬ 
archy to focus on the use of different feature subsets, and 
(ii) we build the hierarchy iteratively from coarse clusters 
to fine-grained clusters (rather than forming all clusters in 
one split) using a greedy top-down algorithm with a novel 
splitting criterion. We first introduce the HMMC formu¬ 
lation in this section, and then describe the optimization 
method in Sec. 4. 

Suppose there are T non-leaf nodes {n t }J =1 in the learned 
hierarchy. We use V t to denote the data on n t , and HMMC 
splits V t into K t clusters by learning a linear model 
for each cluster k. We collect the K t cluster models in 
w t = {w tk}k=v We split the data V t on node n t using 
the MMC idea - finding a clustering assignment such that 
the resultant margin between clusters is maximal over all 
possible assignments. By summing over all the non-leaf 
splits, our global HMMC objective is formulated as: 

min ^ (aG(w,) + 0E( w t ) + — l — ^ g iy ), (1) 

£>0 t =1 \ t\ t x . eVt 

y^yu 

S-t. W tyu X i ~ W ty x i > 1 - £tiy, Vi, Xj Vti 

Uti € {1 ,... ,K t }, Vi,Xj € T> t 

U < X A (yu = y) <u t , Vi, y e {i,, K t } 


ters. Our algorithm uses MMC for each data split, where 
we enforce the maximum-margin criterion by constraining 
the score of fitting x 2 to its assigned cluster to be suffi¬ 
ciently larger than to any other cluster, using the squared 
hinge loss (whose smoothness simplifies the optimization). 
The last constraint enforces the clusters to be balanced, to 
avoid degenerate solutions with empty clusters and infinite 
margins. Here A(-) is an indicator function, while L t and 
U t are the lower and upper bounds controlling the size of 
the clusters. As suggested in (Zhou et al., 2013), we set 
L t and U t to 0.9and 1.1 respectively, to achieve 
roughly balanced clusters at each split. Note that HMMC 
jointly optimizes the model parameters w and clustering 
assignments y = {yu} over all splits. 

The two regularizers G( w t ) and E( w t ) promote learning 
of a semantically meaningful cluster hierarchy. These regu¬ 
larizers encourage splitting on a sparse group of features at 
each node, but encoding a preference towards using differ¬ 
ent features at different levels of the hierarchy. While the 
grouping and competition among features have proved use¬ 
ful for encoding semantic taxonomies in supervised learn¬ 
ing problems (Xiao et al., 2011; Hwang et al., 2011), we 
apply these ideas for discovering semantically meaningful 
cluster hierarchies in an entirely unsupervised setting. 

Group sparsity: In the hierarchy, we would like different 
splits to focus on different subsets of features. Thus, in 
splitting a non-leaf node, we encourage the clustering pro¬ 
cess to only use a sparse set of relevant features. Consider¬ 
ing that there are K t cluster models at node n t , we enforce 
group sparsity over different feature dimensions so that the 
K t models are using the same subset of features. Formally, 
we have the following regularizer on the split of n t : 


G(w t ) 


PKi 


■X 

p=1 


\ 


K t 

X w k P > 


k=1 


( 2 ) 


where P is the feature dimension, and w t k, P is the p- th 
element in w t k . This term encodes that if a feature is irrel¬ 
evant, then it is zero-weighted in all the K t cluster models. 


Exclusive sparsity: We also want the cluster hierarchy to 
use different subsets of features in different layers, so that 
we consider different factors when traversing the hierarchy. 
In other words, a split is expected to explore features that 
are different from its ancestors and descendants, and thus 
the splits compete for features at different layers. We will 
denote a node nt s ancestors by At, which formally is the 
set of nodes on the path from the root to n t . With this 
notation the exclusive regularizer for node n t is defined by: 


where w = {w t } are the cluster model parameters, y ti de¬ 
notes the cluster label of an instance on node n t , £’s 
are slack variables to allow margin violations, G(-) and 
E(-) are regularizers, and a and /? are trade-off parame¬ 


^ JSt r 

E(wt) = K \A IP X X Xl w **l >•*.*!. (3) 

t 1 k=ln a eA t P=l 

where k a indexes the child of n a (n a G At) on the path 
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to n t . Thus, w a k a is the parameter vector for the ancestral 
cluster to which n t belongs. Eq. (3) penalizes coopera¬ 
tion (using the same features) and encourages competition 
(using different features) between a cluster model w t j~ and 
each of its ancestor models {w ak a }n a eA t - The degree of 
competition is calculated as the element-wise multiplica¬ 
tion of the absolute weight values. Intuitively, this means 
that there is no penalty if two models use different fea¬ 
tures, but using the same features results in a high penalty. 
Consequently, minimizing the exclusive sparsity as we split 
nodes will encourage nodes to use features different from 
those used by their ancestors and descendants. In (Xiao 
et al., 2011; Vervier et al., 2014), it is shown that Eq. (3) 
becomes convex when combined with a sufficiently large 
^ 2 -regularizes 

4. Optimization 

The objective of Eq. (1) is non-convex due to the unknown 
hierarchical structure, and because we do not know the split 
on each node that jointly optimizes w and y. To solve the 
problem, we propose a greedy top-down algorithm to build 
the hierarchy (Sec. 4.1), and an alternating descent algo¬ 
rithm for splitting a node (Sec. 4.2). 

4.1. Building the Hierarchy 


Algorithm 1 HMMC: A greedy algorithm for building hierarchy 
Input: n± and V > m is the root node carrying all data in V 
Output: 1~L > the cluster hierarchy including all non-leaf nodes 

1: Initialize: £ 4- {ni}; > the current set of leaf nodes 

2: while the stopping criterion is not met do 
3: for n t G £ do 

4: cluster the data on n t ; > cf. Sec. 4.2 

5: compute the splitting score S(n t ); > cf. Eq. (4) 

6: n* <— argmax ntG £ S(n t ); > greedily find the next split 

7: > move n* from £ to H 

8: for each cluster in n* do 

9: create a leaf node n c carrying the data in that cluster; 

10: link n c as a child of n* ; 

11: £ <(— £ U n c ; > add n c to the current set of leaf nodes 


stopping criterion is satisfied, which could test whether (i) 
a given number of leaf nodes are found, (ii) whether the 
sizes of all leaf nodes are sufficiently small, or (iii) whether 
the hierarchy reaches a height limit. To speed up this pro¬ 
cess, we cache the clustering result on each leaf node, so 
that we do not have to rerun the clustering once the leaf 
node is selected to grow the hierarchy. 

4.2. Splitting A Node 

The clustering on a given node n t is formulated as: 


We build the cluster hierarchy in a top-down manner, where 
the challenge is to iteratively find the next leaf to split. Al¬ 
gorithm 1 gives an overview of our greedy method. We 
start from the root node n\ containing all the data. Note 
that n i starts as a leaf node since it has no children. Each 
iteration tries to split the data on each leaf node n t (Step 4), 
and we define the splitting score (Step 5) as: 


S(n t ) 


G(w t ) + E{ w t )' 


(4) 


The splitting score measures how well, and how easily, 
the data on node n t can be clustered. The numerator of 
Eq. (4) summarizes the scores of fitting each instance to its 
assigned cluster. A high value in the numerator indicates 
compact clusters where the instances are well-fit by the as¬ 
signed cluster models. The denominator of Eq. (4) is the 
regularization term indicating the complexity of the cluster 
models, where a small value implies a simple model. Thus, 
a higher splitting score means the node is a better candidate 
to be split. 


The leaf node to split is choosen to greedily maximize the 
splitting score (Step 6). We fix the cluster models on this 
node, mark it as a non-leaf node, and move it to the hier¬ 
archy (Step 7). Moreover, since we are splitting this node, 
we generate its child nodes according to the clustering re¬ 
sult and add the child nodes to the leaf node set for the next 
iteration (Steps 8 to 11). We iterate this process until the 


min aG(w t )+/3E(w t ) + —^— ^ ^ (5) 

£>0 1 tl 1 XieVty^yu 

where we omit the constraints (from Eq. (1)) for brevity. 
Note that the cluster models of the ancestors of n t have 
been fixed in the greedy top-down learning process. Thus, 
the exclusive regularizer E( w t ) becomes a weighted i\- 
norm (sparsity-inducing) regularizer on w t , where the 
weight on each model parameter w t k,p is set based on the 

ancestor nodes to — Tla ^^ ( p ka ,p . Together with the group 
sparsity G(w t ), this yields a weighted sparse group lasso 
regularizer, generalizing the sparse group lasso regularizer 
of Friedman et al. (Friedman et al., 2010). 

Eq. (5) is still a non-convex problem due to the joint op¬ 
timization over w t and y t . We use an alternating de¬ 
scent algorithm to reach a solution. In each iteration we 
fix the model parameters w t and optimize y t by solving 
a clustering assignment problem, and then we update w t 
while keeping y t fixed using a proximal quasi-Newton al¬ 
gorithm (Lee et al., 2012; Schmidt, 2010). The algorithm 
stops when the objective converges to a local optimum with 
respect to these steps. 

Clustering assignment: With w t fixed, the problem in 
Eq. (5) turns out to be an assignment problem, which min¬ 
imizes the total cost for labeling all instances while main- 
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Figure 1. A sample MCF network. The edge settings are format¬ 
ted as: “[lower capacity , upper capacity ], cost”. See text for de¬ 
tails. 

taining balanced clusters: 

Ctiy 

/---s 

min A ^ u = y) ' I 1 “ w ty x * + w J' x i]+. (6) 

3/'#3/ 

si. yu e {1,..., K t }, VxieVt 

L t < J2 &(yti~y) < U u \/ye{i,...,K t } 

where is the cost for assigning an instance x$ into a 
cluster y. Following (Zhou et al., 2013), we could solve 
Eq. (6) by constructing an integer linear programming 
(ILP) problem with 0(\V t \-K t ) variables and 0{\V t \+K t ) 
constraints. However, this ILP is time-consuming since 
in the worst case the complexity of existing ILP solvers 
is exponential in the number of variables. To efficiently 
solve this problem, we formulate it as a minimum cost flow 
(MCP) problem. 

We re-write the clustering assignment as the problem of 
sending an MCP through an appropriately designed net¬ 
work, illustrated in Pig. 1. The flow capacity of an edge 
from the starting node s to an instance node x z is set to 1 
since we are assigning every instance into a cluster. This 
one unit of flow is sent from x 2 to a cluster node w ty , to 
which the instance is assigned, at cost Cu y . Pinally, each 
cluster node sends its receiving flows to the end node e, 
where we limit the flow capacity in the range [L t ,U t ] to 
take the cluster balance constraints into account. It can 
be shown that clustering the \V t \ instances into K t clus¬ 
ters (under the cluster balance constraints) is equivalent to 
sending \V t | units of flow from s to e, and the optimal net¬ 
work flow corresponds to the minimum total cost of Eq. (6). 
To find this optimal flow, we apply the capacity scaling 
algorithm (Edmonds & Karp, 1972) implemented in the 
LEMON library (Dezs et al., 2011), which is an efficient 
dual solution method running in 0(\V t \ • K t • \og(\V t \ + 
K t ) • log (U t • \V t \ - Kt)) complexity. In practice, our MCF 
solver speeds up the ILP solver in (Zhou et al., 2013) by 10 


to 100 times. 

Updating w t : With fixed y t , we solve for w t (a con¬ 
vex problem) using a proximal quasi-Newton method (Lee 
et al., 2012; Schmidt, 2010). This method is designed to ef¬ 
ficiently minimize smooth losses with non-smooth but sim¬ 
ple regularizers, and on each iteration it computes a new 
estimate w t by solving: 

min aG( w t ) + j3E( w t ) + H(w° ld ) (7) 

W t 

+H'(w° ld ) T ( w t - w° ld ) + 2|| Wt - w^Hl, 

where 5 is a step-size set using a backtracking line-search, 
H(w ° ld ) is the squared hinge-loss (i.e., the last term of 
Eq. (5) after using the constraints to eliminate the slack 
variables) estimated with w° ld from the fixed y t , H\w° ld ) 
is the derivative of H(w° ld ) w.r.t. w ° ld , and ||z||^ = z T Bz 
is a divergence formed using the L-BFGS matrix B (Byrd 
et al., 1994; Nocedal, 1980). 

A spectral proximal-gradient method is used to compute an 
approximate minimizer of this objective. This algorithm 
requires the proximal operator. For our weighted sparse 
group lasso regularizer, we can show that solving this min¬ 
imizing problem involves a two-step procedure. First, we 
incorporate the weighted ^i-norm penalty by applying the 
soft-threshold operator w tktP = [| w tfc, P | - s/3X E }+ 

to each model parameter individually, where A e = 
^ ria xt\A^\p ka P I we ights coming from the ancestor 

models in E(w t ). This operator returns 0 if w t k :P = 0. 
Second, we incorporate the group sparsity using the group- 
wise soft-threshold operator p = ||^ t t:,P || 2 [|| w t:,p|l 2 — 
sa\ G }+, where = [w t i iP , •.., w tKt , P \ T is the group¬ 
ing of K t cluster models on a feature dimension p , and 
A g = normalization term from G(w t ). Note 

that this operator returns 0 if = 0. 

Convergence analysis: We now show that this alternating 
descent algorithm converges to a local optimum. The op¬ 
timization consists of two alternating steps: updating the 
discrete y t and the continuous w t . In the w t update, we 
fix the clustering y t and use a method that is guaranteed to 
find a global optimum (Lee et al., 2012; Schmidt, 2010). 
The y t update (with w t fixed) is NP-hard but we can find 
a solution that guarantees improvement using MCF. Since 
there is a finite number of possible assignments to y t , the 
procedure guarantees convergence to a local minimum with 
respect to updating w t or y t . 

5. Experiments 

Datasets: We evaluate the performance of HMMC on 
four datasets from two public image collections: Animal 
With Attributes (AWA) (Lampert et al., 2009) and Ima- 
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geNet (Deng et al., 2009). Both collections have natural hi¬ 
erarchies consisting of fine-grained image classes that can 
be grouped into more general classes. 

AWA contains 30,475 images from 50 animal classes ( e.g., 
bat and deer). We use two datasets following the practice 
of (Hwang et al., 2011). The first one, AWA-ATTR, has 
85 features consisting of the outputs of 85 linear SVMs 
trained to predict the presence/absence of the 85 nameable 
properties annotated by (Lampert et al., 2009), like red and 
furry. The second dataset, AWA-PCA, uses the provided 
features (SIFT, rgSIFT, PHOG, SURF, LSS, RGB) after 
being concatenated, normalized, and PCA-reduced to 100 
dimensions. The ground-truth hierarchy of AWA is shown 
in Fig. 2 of (Hwang et al., 2011). 

We use two datasets collected from ImageNet: VEHI¬ 
CLE contains 20 vehicle classes (e.g., cab and canoe) and 
26,624 images (Hwang et al., 2011), and IMAGENET 
consists of 28,957 images spanning 20 non-animal, non¬ 
vehicle classes (e.g., lamp and drum) (Hwang et al., 2012). 
The raw image features are the provided bag-of-words 
histograms obtained by SIFT (Deng et al., 2010; 2009). 
We also project them down to 100 dimensions with PC A. 
The semantic hierarchies of VEHICLE and IMAGENET 
are given in Fig. 3 of (Hwang et al., 2011) and Fig. 2(e) 
of (Hwang et al., 2012), respectively. 

Baselines: We compare HMMC with four sets of baselines. 
The first set is the flat clustering methods k-means (KM), 
spectral clustering (SC) (Ng et al., 2001), and an MMC 
approach implemented in (Zhou et al., 2013). 

The second set is hierarchical bottom-up clustering 
(HBUC). We have tested a variety of methods including 
Single-Link (SL), Average-Link (AL) and Complete-Link 
(CL) (Manning et al., 2008). The pairwise dissimilarity be¬ 
tween two images is measured by Euclidean distance. 

The third set is hierarchical top-down clustering meth¬ 
ods (HTDC). We derive variants of hierarchical k-means 
(HKM) and hierarchical spectral clustering (HSC) directly 
from our HMMC approach. HKM and HSC apply the same 
greedy top-down approach as HMMC, but split a given 
node using k-means and spectral clustering, respectively. 
Similar to HMMC, HKM and HSC first try splitting all the 
current leaf nodes, and then greedily grow the leaf with 
the best splitting. The splitting score on a leaf node is de¬ 
fined as the average within-cluster distance - minimizing 
this gives the most compact clusters. We also considered 
two other baselines, HKM-D and HSC-D. Instead of grow¬ 
ing the leaf with the most compact clusters, HKM-D and 
HSC-D grow the leaf with the most scattered data, which is 
defined as the total distance of all instances to their center. 

The fourth set of baselines are variants of HMMC. We 
change the regularization to derive HMMC-G (group spar¬ 


sity only), HMMC-E (exclusive sparsity only), HMMC-1 
(basic ^i-norm), and HMMC-2 (squared £ 2 -norm). 

Parameters: For a fair comparison of all the hierarchical 
top-down clustering methods, we apply the same stopping 
criterion: we test if the number of leaf nodes exceeds a 
fixed limit F. Empirically, we set F as 1, 1.5 and 2 times 
the number of ground-truth classes in each dataset. The 
number of splits on each node also has a great impact on 
the learned hierarchy. To compare different hierarchical 
clustering methods, we simply use K -nary branching for 
all splits in all hierarchies. We experiment with K set as 2, 
3, 4 and 5, respectively. With a particular setting of F and 
K, we can fairly compare different hierarchical clustering 
methods since they perform the same number of splits and 
obtain the same number of leaf nodes. We use the same 
solver for learning HMMC and its variants, and report the 
best performance with both a and [3 selected from the range 

{io- 4 ,io- 3 ,io- 2 ,io-\io 0 }. 

For the HBUC baselines, we apply the same F parame¬ 
ter as above. However, all the HBUC methods use binary 
branching and there is no result for K larger than 2. 

For the flat clustering methods (i.e., KM, SC and MMC), 
we set the number of clusters to F to fairly compare per¬ 
formance with hierarchical methods. For SC, we use a 5- 
nearest neighborhood graph and set the width of the Gaus¬ 
sian similarity function as the average distance over all the 
5-nearest neighbors. This also applies in HSC and HSC-D 
which use SC for splitting a node. 

Performance Measures: We evaluate all the methods by 
three performance measures. The first two are semantic 
measures focusing on how well the learned hierarchy cap¬ 
tures the semantics in the ground-truth hierarchy. The mo¬ 
tivation is that two semantically similar images should be 
grouped in the same or nearby clusters in the learned hi¬ 
erarchy, and two semantically dissimilar images should be 
split into clusters that are far away from each other. 

For each pair of images, we compute their semantic simi¬ 
larity from the ground-truth hierarchy, using the following 
two metrics. The first shortest path metric (Harispe et al., 
2013) finds the shortest path linking the two image classes 
in the ground-truth hierarchy, normalizes the path distance 
by the maximum distance, and subtracts the distance from 
1 as the semantic similarity. The second path sharing met¬ 
ric (Fergus et al., 2010) counts the number of nodes shared 
by the parent branches of the two image classes, normal¬ 
ized by the length of the longest of the two branches. Note 
that we can similarly define the shortest path similarity and 
the path sharing similarity using the learned hierarchy, for 
any pair of images, by checking the leaf node(s) where the 
two images are clustered. For flat clustering with no hier¬ 
archy, we simply set the similarity as 1 if two images are 
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AWA-ATTR AWA-PCA VEHICLE IMAGENET 


Methods 

SP 

PS 

RI 

runtime 

SP 

PS 

RI 

runtime 

SP 

PS 

RI 

runtime 

SP 

PS 

RI 

runtime 

FLAT 

KM 

77.95 

92.83 

96.04 

5.2 

77.48 

91.34 

94.50 

7.4 

75.44 

76.76 

78.08 

2.9 

79.66 

82.03 

87.14 

4.0 

sc 

77.90 

92.54 

95.71 

209.2 

77.16 

88.72 

91.21 

172.2 

74.15 

74.00 

74.12 

112.0 

69.39 

67.79 

61.25 

137.3 

MMC 

77.08 

83.67 

83.71 

15957.7 

77.32 

90.59 

93.54 

6077.0 

78.03 

84.23 

88.49 

1366.6 

79.98 

82.74 

89.24 

2634.3 

HBUC 

SL 

63.97 

16.24 

2.65 

88.3 

64.00 

16.24 

2.69 

80.9 

58.21 

29.12 

5.30 

61.2 

45.85 

32.97 

5.24 

84.5 

AL 

74.55 

38.99 

32.84 

72.1 

64.37 

17.00 

3.94 

80.9 

58.24 

29.17 

5.45 

47.8 

45.87 

33.00 

5.29 

49.0 

CL 

92.60 

87.54 

93.33 

81.8 

68.14 

22.63 

34.27 

47.6 

58.30 

29.26 

5.58 

54.2 

46.46 

33.62 

6.59 

69.3 

HTDC 

HKM 

71.95 

40.46 

30.02 

1.6 

85.00 

76.86 

79.77 

3.2 

77.68 

65.75 

59.75 

2.2 

82.85 

80.93 

84.67 

1.9 

HSC 

81.59 

69.84 

67.43 

247.0 

79.47 

47.69 

57.25 

873.2 

68.59 

45.84 

36.62 

745.4 

64.76 

52.69 

53.64 

913.6 

HKM-D 

92.59 

91.01 

95.97 

5.8 

91.43 

88.24 

95.02 

2.4 

84.89 

74.77 

85.37 

1.9 

81.42 

80.87 

86.60 

3.2 

HSC-D 

94.18 

90.38 

95.98 

293.4 

79.94 

48.02 

57.97 

873.4 

69.29 

46.07 

37.11 

316.9 

48.19 

37.01 

11.00 

892.9 

HMMC 

94.40 

91.03 

95.96 

1986.9 

93.69 

89.66 

95.65 

1550.1 

90.48 

85.08 

90.16 

994.3 

86.94 

84.63 

90.59 

1411.6 

VARIANT 

HMMC-G 

94.36 

90.83 

95.87 

1389.6 

93.77 

89.59 

95.56 

1254.2 

90.40 

85.03 

90.10 

883.6 

86.69 

84.33 

90.56 

1016.4 

HMMC-E 

93.81 

90.74 

95.45 

788.2 

93.11 

89.39 

94.79 

1408.8 

87.82 

83.00 

86.26 

616.3 

85.77 

83.98 

89.06 

1658.4 

HMMC-1 

87.77 

77.49 

77.84 

558.1 

92.01 

89.69 

95.49 

769.9 

89.94 

84.72 

89.54 

486.9 

86.81 

84.52 

90.52 

690.1 

HMMC-2 

92.70 

90.92 

95.99 

893.2 

93.65 

89.47 

95.25 

1449.1 

90.05 

84.71 

89.68 

541.3 

86.13 

84.08 

89.94 

1158.3 


Table 1. Clustering performance on the four datasets. SP, PS and RI are reported in percentage, and the boldfaced numbers achieve the 
best performance among flat and hierarchical methods (excluding HMMC variants). The runtime (in seconds) is measured on a machine 
with Intel Xeon 2.8GHz CPU and 16GB memory. 
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Figure 2. Using different settings of F and K. (a-d) plot against F while fixing K — 2, and (e-h) plot against K while fixing F as the 
number of ground-truth classes in each dataset. 


from the same cluster, and 0 otherwise. 

To measure the goodness of the learned hierarchy in captur¬ 
ing semantics, we compute the mean squared error of the 
learned similarity and the ground-truth semantic similarity 
over all pairs of images, and subtract the mean squared er¬ 
ror from 1 as our semantic measure. Note that we have two 
semantic measures, the shortest path similarity (SP) and the 
path sharing similarity (PS). The higher the values, the bet¬ 
ter the performance. 

Moreover, we also report the Rand Index (RI) (Rand, 
1971), which evaluates the percentage of true positives 
within clusters and true negatives between clusters. Note 
that RI is a commonly-used measure for flat clustering. For 
hierarchical clustering methods, we simply ignore the hier¬ 
archy and evaluate RI on the leaf node clustering (allowing 


direct comparisons with flat clustering methods). 

5.1. Results 

Comparing flat and hierarchical methods: Due to space 
limitations, we only report the clustering results with F 
equal to the number of ground-truth classes and K — 2 
(binary splitting). The results are listed in Table 1, which 
shows that HMMC achieves the best performance on AWA- 
PCA, VEHICLE and IMAGENET, and competitive results 
on AWA-ATTR. Specifically, HMMC improves over the 
second best by 0.2% on AWA-ATTR, 2% on AWA-PCA, 
6% on VEHICLE and 4% on IMAGENET, respectively, 
in terms of the semantic measure SP. This verifies that 
HMMC better captures the semantics in the clustered hi¬ 
erarchies. 
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Figure 3. (Best viewed in color.) The learned hierarchy on AWA-ATTR with binary branching. Here we show the results on the first 
four layers. For each node, we visualize three majority image classes (with the number of images from each class listed below the 
sample image), and the three most discriminative attributes (ranked by the magnitude of the regularization on the corresponding feature 
dimension). 


Moreover, HMMC outperforms other HTDC baselines in 
most cases, showing the effectiveness of our greedy top- 
down algorithm for hierarchy building and our alternating 
descent algorithm for splitting data on a given node. Note 
that the HBUC baselines tend to perform worse since they 
typically produced extremely unbalanced clusters at the top 
levels ( e.g ., a child contains only one sample). They also 
did not lead to semantically meaningful hierarchies. 

Using different F and K : We also vary the parameters 
F (i.e., the number of leaf nodes) and K (i.e., the num¬ 
ber of splits), and plot the SP performance in Fig. 2. Here 
we have omitted the poor results of hierarchical bottom-up 
methods for better visualizations. HMMC consistently out¬ 
performs the other baselines on AWA-PCA, VEHICLE and 
IMAGENET, and is comparable with HKM-D and HSC-D 
on AWA-ATTR. Note that the performance of HMMC is 
stable with regard to the different settings of F and K. 

Comparing the variants of HMMC: Table 1 shows that 
HMMC gets slightly better performance over the four vari¬ 
ants of HMMC. This is reasonable since HMMC produces 
sparse models that may better capture semantics. We also 
compare the model sparsity (i.e., the percentage of zeros 
in the learned models) in Fig. 4. Here we omit HMMC-2 
since the model is always non-sparse. For a fair compari¬ 
son, we fix the trade-off parameters to 1 in all models. Note 
that by combining the grouping and exclusive regularizers, 
HMMC is sparser than HMMC-G and HMMC-E. HMMC- 



Figure 4. HMMC model sparsity. See text for details. 


1 sometimes has slightly better sparsity than HMMC, but 
the performance is limited due to the lack of semantics. 

Runtime comparison: Table 1 also reports the runtime re¬ 
sults. Our implementation of HMMC is between 1.4 to 8 
times faster than MMC, showing the efficiency of the hi¬ 
erarchical method. Note that HMMC is more expensive 
than other hierarchical and flat methods. This is reasonable 
since HMMC needs to solve a more expensive optimization 
problem during clustering. 

Visualizations: Fig. 3 visualizes the learned hierarchy on 
AWA-ATTR. See the caption for details. Our model cap¬ 
tures semantically meaningful attributes in building the hi¬ 
erarchy - note how the attribute “quadrapedal” is used to 
separate “whales” and “polar bears”, and how “longneck” 
is used to divide “rhinos” and “giraffes”. 
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6. Conclusion 

We have presented a hierarchical clustering method for 
unsupervised construction of taxonomies. We develop a 
greedy top-down splitting criterion, and use the grouping 
and exclusive regularizers for building semantically mean¬ 
ingful hierarchies from unsupervised data. Our method 
makes use of maximum-margin learning, and we propose 
effective algorithms to solve the resultant non-convex ob¬ 
jective. We test our method on four standard datasets, 
showing the efficacy of our method in clustering, and the 
ability to capture semantics via the hierarchies. 
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